Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PoC: server handling multiple clients with custom attention mask api #3490

Closed
wants to merge 22 commits into from

Conversation

FSSRepo
Copy link
Collaborator

@FSSRepo FSSRepo commented Oct 5, 2023

From #3462, I wanted to update my fork to the latest changes from the master branch, but it went wrong :(.

Hello, I know it's something no one asked for, but some of us need it. Here's a proof of concept of a server that handles multiple clients simultaneously, thanks to the new way of working with the KV cache.

Some may wonder why reimplementing this proposal in a separate example. The current implementation of the server is quite complex, and many things could break.

Tested on:

Windows 11 x64
Intel Core i5 11400H 6 C / 12 T
RTX 3050 laptop 4 GB VRAM
16 GB of RAM DDR4 3200MHz
Server.Parallel.Improvements.mp4

This is a proof of concept for now, with some feedback and assistance, we could make it more usable.

Here is the command to start the server:

./server-parallel -m models/7B/ggml-model.gguf --ctx_size 2048 -t 4 -ngl 33 --batch-size 512 --parallel 3 -n 512 --cont-batching

Modify --parallel to the number of slots to process clients requests.

Edit:

  • New video showing 4 clients at same time, my laptop almost exploded 😂.
  • Improve the PR note.

Lastest changes:

  • Improved README
  • Use a custom system prompt and change it on runtime

Note:

Many people are going to want to kill me when they see how I handle multithreading without using mutex; I never knew what that was for :(.

@ggerganov
Copy link
Owner

Works great! Here is another demo on M1 Pro with F16 7B:

server-parallel-0.mp4

@FSSRepo Can you allow pushes to your branch so I can push some fixes:

$ ▶ git push FSSRepo HEAD:fixes 
error: Authentication error: Authentication required: You must have push access to verify locks
error: failed to push some refs to 'https://github.com/FSSRepo/llama.cpp'

server-parallel : add "--reverse-prompt" + compiler warning fixes
@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 6, 2023

@FSSRepo Can you allow pushes to your branch so I can push some fixes:

Screenshot 2023-10-06 094305

I don't have some restriction in that branch.

@ggerganov can you test full offloading model in your mac please? for confirm if there is some bug in faster generation scenario.

@kiratp
Copy link

kiratp commented Oct 6, 2023

First off - this is awesome! Thank you @FSSRepo!

I am going to take the shameless opportunity here to request that adding support for speculative execution be considered here - that would make the first and only OSS LLM server I have come across that supports it out of the box!

@Seikho
Copy link

Seikho commented Oct 7, 2023

Can this support the same generation parameters that are used in the existing example server?

@ggerganov
Copy link
Owner

Here is a 30B LLaMA Q4_0 serving 4 clients on M2 Ultra:

server-parallel-1.mp4

(I tried 7B and 13B, but the generation is so fast that I am not able to start the requests in parallel)

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 8, 2023

(I tried 7B and 13B, but the generation is so fast that I am not able to start the requests in parallel)

I want to add the option to cancel the streaming but when I use AbortControl in the frontend it causes error, I'm thinking add a get endpoint stop?slot=id to request release the sequence.

@ggerganov
Copy link
Owner

And one more example using the original server UI with 70B Q8_0 model:

server-parallel-2.mp4

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 8, 2023

And one more example using the original server UI with 70B Q8_0

Are you working in implement that ui or just reuse the endpoints?

@ggerganov
Copy link
Owner

ggerganov commented Oct 8, 2023

Are you working in implement that ui

No, I just noticed it works and gave it a try. I thought you implemented it.

I want to add the option to cancel the streaming but when I use AbortControl in the frontend it causes error, I'm thinking add a get endpoint stop?slot=id to request release the sequence.

What kind of error?

@AdityaSher
Copy link

How is this working? Are the instances sharing the same weights or the model needs to be loaded N times?

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 8, 2023

How is this working? Are the instances sharing the same weights or the model needs to be loaded N times?

Just divides the kv cache (context size) to a number of sequences, the limit is the context size because it is shared between clients.

This not reload the model, just once.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 8, 2023

No, I just noticed it works and gave it a try. I thought you implemented it.

Only had reused the completion.js function. 😂

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 8, 2023

What kind of error?
Screenshot 2023-10-08 173203

 controller = new AbortController();
    const response = await fetch("http://localhost:8080/completion", {
      method: "POST",
      body: JSON.stringify(options),
      headers: {
        Connection: "keep-alive",
        "Content-Type": "application/json",
        Accept: "text/event-stream",
      },
      signal: controller.signal,
    });

function cancel() {
  if(controller) {
/* Anyway, even though I aborted it, the slot doesn't release, it continues to generate because the stream doesn't receive the signal that the connection was closed. 
Easy fix: create a stop endpoint to notify to slot release
*/
    controller.abort(); // when i call this function i get DOMException
  }
}

@KerfuffleV2
Copy link
Collaborator

when I use AbortControl in the frontend it causes error,

"Note: When abort() is called, the fetch() promise rejects with a DOMException named AbortError." — https://developer.mozilla.org/en-US/docs/Web/API/AbortController

So that's probably normal. You can possibly try to catch the exception, like in the example there. I wouldn't think that would cause the connection not to abort properly though, so it not stopping generation on the server side is probably a different problem.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 8, 2023

so it not stopping generation on the server side is probably a different problem.

I will fix it adding a GET endpoint stop_completion to notify the server release the slot.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 9, 2023

@ggerganov can you merge the latest changes of master in this branch, please. I'm afraid to make a mistake again.

@ggerganov
Copy link
Owner

This example is very nice to have, but I'm thinking if we should try to directly implement the functionality in the existing server example. I'm not sure if I'll have time to do it myself though and I don't want to delay this super long. Also, the current implementation keeps one CPU core at 100% all the time which is not desirable.

I'll give this PR some more time to see if people would be interested in improving this. If not, we will probably merge this and add an item on the roadmap for the future.

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 11, 2023

This example is very nice to have, but I'm thinking if we should try to directly implement the functionality in the existing server example. I'm not sure if I'll have time to do it myself though and I don't want to delay this super long. Also, the current implementation keeps one CPU core at 100% all the time which is not desirable.

I'll give this PR some more time to see if people would be interested in improving this. If not, we will probably merge this and add an item on the roadmap for the future.

I will do it, i just want that you push the latest changes of master in my branch fixes, I don't know how to do that. Or if you teach me, I will be glad.

@cebtenzzre
Copy link
Collaborator

Or if you teach me, I will be glad.

You can add this repo as a remote in your fork (git remote add upstream https://github.com/ggerganov/llama.cpp.git) and then pull from it with your 'fixes' branch checked out (git pull --no-rebase upstream master).

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 11, 2023

@cebtenzzre Thank you so much!

@FSSRepo
Copy link
Collaborator Author

FSSRepo commented Oct 13, 2023

@ggerganov "Do you want me to close this PR? I initially proposed this as a simple example to avoid the complexity of the server example.

@ggerganov
Copy link
Owner

@FSSRepo Let's see if we can make the other PR work and if so, we will close this one

@ggerganov
Copy link
Owner

Superseded #3677

@ggerganov ggerganov closed this Oct 22, 2023
@FSSRepo FSSRepo deleted the fixes branch October 22, 2023 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants